Now that we have seen in class the theory underlying generalized linear models, let’s learn how to fit them in R. For this lecture we will explore a few examples.
First let’s remember a few key concepts. Regression models assume the mean response (\(\mu_i\)) for an observation \(i\) depends on \(p\) explanatory variables \(x_1, x_2, \dots, x_p\) via some general function \(f\) through a number of regression parameters \(\beta\) such that: \[ \mathbb{E}[y_i] = \mu_i = f(x_{1i}, x_{2i}, \dots, x_{pi}) \]
In linear models, it is assumed that the response variable (\(y_i\)) changes continuously and errors are normally distributed around the mean. Hence, the expected value was a linear (additive) combination of the parameters, and it had the form: \[ y_i \sim Normal(\mu_i, \sigma) \\ \mu_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} \]
However, in many cases some of these assumptions are not met and the response variable does not have support in the whole real line. For example when a researcher has count data, binary responses, only positive values. Additionally, errors might not be normally distributed and variance might change with the mean, which violates the assumptions of a linear model. For such cases, we rely on generalized linear models (glm). In a generalized linear model, the expected value is some function of an additive combination of parameters.
For example, a glm for a binary response uses a Bernoulli distribution:
\[ \begin{align} Y \sim Bernoulli(p_i) \\ p_i = g^{-1}(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2}) \\ g(p_i) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} \end{align} \]
where \(g()\) is a link function, which in the case of a Bernoulli distribution can be the logit function, which will act to take the linear model parameters, which are in the real line, and convert them to the probability space, so that they can predict the value of the outcome variable.
A generalized linear model has 3 components:
In 1973, Bickel et al published a paper on admission rates of graduate students at UC-Berkeley. They were interested in investigating whether there were any gender biases in admission rates, as it seemed like women were suffering from lower acceptance rates. With that in mind, the authors wanted to dig deeper and investigate this problem to better understand whether or not there was any evidence for gender biases. Let’s take a look at that data set.
berk_data <- as.data.frame(UCBAdmissions)
berk_data$Admit <- fct_relevel(berk_data$Admit, "Rejected", "Admitted")
# Overall admission rate by gender
admi_by_gender <- berk_data %>% as_tibble() %>%
group_by(Admit, Gender) %>%
summarize(n = sum(Freq)) %>%
ungroup() %>%
group_by(Gender) %>%
mutate(prop = n/sum(n)) %>%
filter(Admit == "Admitted")
admi_by_gender
# Now visualizing the proportions with a barplot
ggplot(admi_by_gender, aes(x = Gender, y = prop, fill = Gender)) +
geom_col() +
geom_text(aes(label = round(prop, 3)), vjust = -1) +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous("Acceptance (proportion)", limits = c(0,0.5)) +
theme_bw() +
guides(fill = "none")
It seems like women have a lower acceptance rate than men. But there is more information on the data set that we haven’t used yet. The data is separated by gender and Department, so let’s look at acceptance rates by department.
# Overall admission rate by gender
admi_dept_gender <- berk_data %>% as_tibble() %>%
group_by(Gender, Dept) %>%
mutate(prop = Freq/sum(Freq)) %>%
filter(Admit == "Admitted")
admi_dept_gender
# Now visualizing the proportions with a barplot
ggplot(admi_dept_gender, aes(x = Gender, y = prop, fill = Gender)) +
geom_col() +
geom_text(aes(label = round(prop, 3)), vjust = -1) +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous("Acceptance (proportion)", limits = c(0,1)) +
facet_wrap(~Dept)+
theme_bw() +
guides(fill = "none")
We can see that Departments vary in their acceptance rates, meaning some departments are much harder to get in (like Department F, which has an acceptance rate of less than \(0.1\)) in comparison with others (like department A). Another observation is that women actually have higher acceptance rates in all but 2 departments (C and E).
Now let’s do some model fitting and compare the results when looking at gender acceptance rates alone and when we control for the effect of department.
There are (at least) two ways in which we can model this data which would provide us slightly distinct interpretations. One is to consider the probability that a given individual gets accepted at UC-Berkeley, which is a result of a binary variable (whether the person gets accepted or rejected). Another approach is to consider the acceptances as the outcome of an event and model the total number of accepted individuals. In both cases, gender and department might play a role in determining acceptance outcome. Let’s find out.
If we opt for a Logistic Regression, we are considering binary outcomes (admitted versus rejected), and ultimately we want to know their probabilities. Alternatively, if we are modeling the number of events (number of accepted females and males) we would use a Poisson GLM.
For the logistic regression approach we need to de-aggregate the data.
Our first (null) model only considers the baseline probability of acceptance.
# De-aggregating the data with the function uncount()
berk_data_long <-
uncount(data = berk_data, weights = Freq) %>%
## Admit = 1
mutate(Admit = factor(Admit, levels = c("Rejected", "Admitted")))
### Logistic regression #######
# Model 0: Only intercept, overall probability of admission
m0 <- glm(Admit ~ 1, data = berk_data_long, family = "binomial")
## Model summary
summary(m0)
As you can see in the model summary, this model has estimates a single parameter. With this, we get the expected value of the logit of the overall admittance probability, which is the single parameter itself:
(m0.cf <- coef(m0))
To express this coefficient as a probability (of being admmited in this case), we apply the inverse of the logit function to the single expected value this model provides, for all observations:
Now let’s include the information on gender.
Does it seem like there is a gender bias in acceptance rate? What is the direction of the bias?
Again, it might help to look at the probabilities of admmitance predicted by this model. To do this, we first calculate the linear predictors for men and women:
# Fitted coefficients for model 1
(m1.cf <- coef(m1))
## Linear predictor for men
# For males, it is the first coefficient (they are the first level)
(m1.pM <- as.numeric(m1.cf[1]))
## Linear predictor for women
(m1.pW <- as.numeric(m1.cf[1] + m1.cf[2]))
These are the log of odds ratios of admmitance for each gender. And then we apply the inverse of the logistic function to express this numbers as admmitance probabilities:
Now let’ take a closer look and estimate the effect of both gender and department on probability of admission
This model predicts probabilities of being admmited for men and women in each department. For example, to get these probabilities for department “F” you can do that:
# Fitted coefficients for model 2
(m2.cf <- as.numeric(coef(m2)))
## Expected log odss-ratios by gender in Department F
## Men
m2.pFM <- m2.cf[1] + m2.cf[7]
## Women
m2.pFW <- m2.cf[1] + m2.cf[1]+ m2.cf[7]
## Predicted probabilities (inverse logit function)
## Men
exp(m2.pFM) / (1 + exp(m2.pFM))
## Women
exp(m2.pFW) / (1 + exp(m2.pFW))
Explore probability of acceptance by the different departments. Does the same results you observed before still hold? What is the evidence of gender bias in acceptance at UC-Berkeley in 1973?
When first looking at the data, it seemed like departments were hiring more men than women. When all departments are combined, it seemed as they were ‘discriminating’ against women. However, once we looked at the different departments, the overall picture was a bit different. When exploring the probability of acceptance by gender across departments, it seemed like women had a slightly higher probability of acceptance at most departments.
One of the drivers behind this striking result is that women were more likely to apply to departments where fewer applicants could be admitted. Men, on the other hand, seemed to apply to departments that were less competitive.
Conclusions based on a large dataset can therefore be contradicted by conclusions from subgroups within the dataset. This phenomenon is known as Simpson’s Paradox, named after the British statistician Edward Simpson, who published an article in 1951 explaining the phenomenon. The paradox can be resolved when confounding variables and causal relations are appropriately addressed in the statistical modeling.
Going back to our Berkeley example, The authors of the study could not reach a conclusion regarding gender biases in acceptance. Rather, they realized there is a difference in the paths chosen by women and men when pursuing grad studies. This actually opens up the possibility of asking further question such as:
Why do males tend to apply to engineering more often than females, and why is this reversed for the English department?
Why is it it the case that the departments that tend to have a female-application bias tend to have lower overall admission rates than those departments that have a male-application bias?
Might this not still reflect a gender bias, even though every single department is itself unbiased?
Another way to look at the same data and fit models to explore patterns of acceptance is through a Poisson Regression. With a Poisson regression we are interested in the number of events (counts), which in our case are the number of students admitted.
### Poisson regression #######
# Model 0: Only intercept - Mean number of admitted students
m0_poi <- glm(Freq ~ Admit, data = berk_data, family = "poisson")
This model predicts the overall number of admmited and rejected people. To get this expected values, calculate the linear predictors and then take the exponential, as the link function is log:
m0_poi.cf <- as.numeric(coef(m0_poi))
## Value fo linear predictors
## Rejected
m0_poi.R <- m0_poi.cf[1]
## Admmited
m0_poi.A <- m0_poi.cf[1] + m0_poi.cf[2]
## Expected frequencies
## Rejected
exp(m0_poi.R)
## Admmited
exp(m0_poi.A)
## Compare with the observed frequencies
mean(subset(berk_data, Admit == "Rejected")$Freq)
mean(subset(berk_data, Admit == "Admitted")$Freq)
And the number of admmited women expected by this model in a single R command:
Now you can explore the expected number of admitted students by gender and department. Compare the results you find by each modeling approach. Do they tell the same story? Are your conclusions still met?
Now let’s fit the same data set under a Bayesian framework using the rethinking package. Again we want to evaluate gender biases in admission by asking whether there is an association between gender and probability of admission.
In our first model we are considering the Logistic Regression, which uses the Binomial Distribution. Our model can be written as:
\[ \begin{align} A_i \thicksim Binomial(N_i, \pi_i) \\ logit(\pi_i) = \beta_{gender_{[i]}} \\ \beta_{gender_{[i]}} \thicksim Normal(0, 1.5) \end{align} \]
where \(N_i\) indicates \(applications[i]\), the number of applications on row \(i\). The index variable \(gender[i]\) indexes gender of an applicant. 1 indicates male, and 2 indicates female
# With some massaging of the data we filter the admitted students, the total applications and the gender
# Don't worry too much on how to massage the data
library(rethinking)
dat_list <- list(admit = berk_data %>% filter(Admit == "Admitted") %>%
pull(Freq),applications = berk_data %>%
group_by(Dept, Gender) %>%
mutate(total = sum(Freq)) %>%
dplyr::slice(1) %>%
pull(total),gid = berk_data %>%
filter(Admit == "Admitted") %>%
mutate(gid = ifelse(Gender == "Male", 1, 2)) %>%
pull(gid))
m1_bayes <- ulam(
alist(
admit ~ dbinom( applications , p ) ,
logit(p) <- a[gid] ,
a[gid] ~ dnorm( 0 , 1.5 )
) , data=dat_list , chains=4 )
precis( m1_bayes , depth=2 )
The posterior for male applicants, \(a[1]\), is higher than that of female applicants. How much higher?
We need to compute the contrast.
post <- extract.samples(m1_bayes)
diff_a <- post$a[,1] - post$a[,2]
diff_p <- inv_logit(post$a[,1]) - inv_logit(post$a[,2])
precis( list( diff_a=diff_a , diff_p=diff_p ) )
The log-odds difference is certainly positive, corresponding to a higher probability of admission for male applicants. On the probability scale itself, the difference is somewhere between 12% and 16%.
Before moving on to speculate on the cause of the male advantage, let’s plot posterior predictions for the model. We’ll use the default posterior validation check function, postcheck, and then dress it up a little by adding lines to connect data points from the same department.
postcheck( m1_bayes )
# draw lines connecting points from same dept
for ( i in 1:6 ) {
x <- 1 + 2*(i-1)
y1 <- dat_list$admit[x]/dat_list$applications[x]
y2 <- dat_list$admit[x+1]/dat_list$applications[x+1]
lines( c(x,x+1) , c(y1,y2) , col=rangi2 , lwd=2 )
text( x+0.5 , (y1+y2)/2 + 0.05 , dat_list$dept[x] , cex=0.8 , col=rangi2 )
}
Those are pretty terrible predictions. There are only two departments in which women had a lower rate of admission than men (\(C\) and \(E\)), and yet the model says that women should expect to have a 14% lower chance of admission.
Again reinterpreting the data, the problem in this case is that men and women did not apply to the same departments, and departments vary in their rates of admission. This makes the answer misleading. You can see the steady decline in admission probability for both men and women from department A to department F. Women in these data tended not to apply to departments like A and B, which had high overall admission rates. Instead they applied in large numbers to departments like F, which admitted less than 10% of applicants.
So while it is true overall that women had a lower probability of admission in these data, it is clearly not true within most departments.
And note that just inspecting the posterior distribution alone would never have revealed that fact to us. We had to appeal to something outside the fit model. In this case, it was a simple posterior validation check.
Now we can actually go a bit further and ask different questions. What if instead of asking “What are the average probabilities of admission for women and men across all departments?” we want to ask “What is the average difference in probability of admission between women and men within departments?”
In order to ask the second question, we estimate unique female and male admission rates in each department. Here’s a model that asks this new question:
\[ \begin{align} A_i \thicksim Binomial(N_i, \pi_i) \\ logit(\pi_i) = \beta_{gender_{[i]}} + \beta_{department_{[i]}}\\ \beta_{gender_{[i]}} \thicksim Normal(0, 1.5) \\ \beta_{department_{[i]}} \thicksim Normal(0, 1.5) \end{align} \]
where \(department\) indexes each department. So now each department \(k\) gets its own log-odds of admission, \(\beta_k\), but the model still estimates universal adjustments, which are the same in all departments, for male and female applications.
We’ll use the indexing notation again to construct an intercept for each department. But first, we also need to construct a numerical index that numbers the departments \(1\) through \(6\).
The intercept for male applicants, \(a[1]\), is now a little smaller on average than the one for female applicants. Let’s calculate the contrasts against, both on relative and absolute scales:
post <- extract.samples(m2_bayes)
diff_a <- post$a[,1] - post$a[,2]
diff_p <- inv_logit(post$a[,1]) - inv_logit(post$a[,2])
precis( list( diff_a=diff_a , diff_p=diff_p ) )
If male applicants have it worse, it is only by a very small amount, about 2% on average.
Why did adding departments to the model change the inference about gender so much? The earlier figure gives you a hint—the rates of admission vary a lot across departments. Furthermore, women and men applied to different departments. Let’s do a quick tabulation to show that:
These are the proportions of all applications in each department that are from either men (top row) or women (bottom row). Department A receives 88% of its applications from men. Department E receives 33% from men. Now look back at the delta posterior means in the precis output from \(m2_{bayes}\). The departments with a larger proportion of women applicants are also those with lower overall admissions rates.
Department is probably a confound, in the sense that it misleads us about the direct causal effect. But it is not a confound, in the sense that it is probably a genuine causal path through gender influences admission. Gender influences choice of department, and department influences chance of admission.
Now let’s check the posteriors for Model 2:
postcheck( m2_bayes )
# draw lines connecting points from same dept
for ( i in 1:6 ) {
x <- 1 + 2*(i-1)
y1 <- dat_list$admit[x]/dat_list$applications[x]
y2 <- dat_list$admit[x+1]/dat_list$applications[x+1]
lines( c(x,x+1) , c(y1,y2) , col=rangi2 , lwd=2 )
text( x+0.5 , (y1+y2)/2 + 0.05 , dat_list$dept[x] , cex=0.8 , col=rangi2 )
}
Rethinking: Simpson’s paradox is not a paradox Like most paradoxes, there is no violation of logic, just of intuition. And since different people have different intuition, Simpson’s paradox means different things to different people. The poor intuition being violated in this case is that a positive association in the entire population should also hold within each department. Overall, females in these data did have a harder time getting admitted to graduate school. But that arose because females applied to the hardest departments for anyone, male or female, to gain admission to.
Perhaps a little more paradoxical is that this phenomenon can repeat itself indefinitely within a sample. Any association between an outcome and a predictor can be nullified or reversed when another predictor is added to the model. And the reversal can reveal a true causal influence or rather just be a confound
All that we can do about this is to remain skeptical of models and try to imagine ways they might be deceiving us. Thinking causally about these settings usually helps.
The original paper analyzing the admissions case
The Bayesian part of the tutorial comes from the excellent Statistical Rethinking